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METHOD FOR GENERATION OF SEQUENCE SAMPLED MAPS OF 

COMPLEX GENOMES 

The present invention relates to recombinant 
DNA technology. More particularly, the invention 
5 concerns a process for rapidly generating a physical 
sequence map of large complex genomes, including human 
chromosomes. The sequence mapping process ("sequence 
sampled mapping") depends on the use of cosmid vectors 
containing endogenous bacteriophage promoters to allow 
10 for the sequencing of end-specific nucleotides of each 
member of a contiguous library of cosmid clones. 

Background of the Invention 

The complete analysis of large complex genomes, 
such as genomes of higher eukaryotes, including human, 

15 requires the extensive isolation, purification and 

analysis of large fragments of DNA by cloning, generally 
in E. coli. In the past, the lambda bacteriophage 
cloning system has been used most frequently to generate 
genomic libraries. The lambda bacteriophage vectors 

20 usually accommodate inserts up to about 20 kb. Presently 
the primary system used to clone and manipulate large DNA 
fragments is that of cosmid vectors. Cosmid vectors 
allow the packaging of DNA fragments of up to about 45 kb 
in plasmids containing bacteriophage cos sites for in 

25 vitro packaging. 

The analysis of complex genomes involves the 
application of both "top-down" and "bottom-up" mapping 
strategies. The "top-down" strategy depends on the 
separation on pulsed field gels of large DNA fragments 
30 generated using rare restriction endonucleases for 

physical linkage of DNA markers and the construction of 
long-range maps [Schwartz et al., Cell 37:67 (1984); 
Southern et al-, Nucleic Acids Res. 15:5925 (1987); Burke 
et al., Science 236 :806 (1987)]. The "bottom-up" 
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strategy depends on identifying overlapping sequences in 
a large number of randomly selected bacteriophage or 
cosmid clones by unique restriction enzyme 
"fingerprinting" and their assembly into overlapping sets 
5 of clones. "Top down" mapping is inherently more rapid 
and less labor intensive, but does not generate sets of 
DNA clones for further structural or biological analysis. 
"Bottom-up" mapping generates the required sets of 
overlapping clones but application of current strategies 
10 and pattern matching algorithms to mammalian genomes will 
require the analysis of thousands to tens of thousands of 
individual clones for the generation of complete maps. 

Clone-based physical maps have been extremely 
useful as the framework for many types of structural and 

15 biological studies and have been constructed for several 
model organisms including E. coli, C. elegans, D. 
melanogaster and S. cerevesiae (Kohara et al., 1989 , 
Cell , 50:495-508; Oliver et al., 1992, Nature, 357:38-46; 
Sulston et al-, 1992, Nature . 356:37-41; Merriam et al., 

20 1991, Science , 254:221-225). In the past few years, a 
variety of techniques have been utilized for the 
construction of ordered clone maps including cosmid and 
phage contig-building (Olson et al., 1986, Proc. Natl. 
Acad. Sci. . USA , 83:7826-7830; Coulson et al., 1986, 

25 Proc. Natl. Acad. Sci.. USA . 83:7821-7825), analysis of 
arrayed libraries (Evans and Lewis, 1989, Proc. Natl. 
Acad. Sci., USA . 86:5030-5034), linking libraries and 
pulsed-field gel analysis (Poustka and Lehrach, 1986, 
Trends Genetics . 2:174-179; Hermanson et al-, 1992, 

30 Genomics, 13:134-143) and the assembly of YAC clone 
contigs (Bellanne-Chantelot et al., 1992, Cell . 
70:1059-1068; Foote et al-, 1992, Science 258:60-66). 



35 



Another approach for the assembly of clone maps 
is the use of sequence tagged sites (STSs; Olson et al, 
1989, Science . 245:1434-1435): mapped DNA sequence 
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fragments that can be detected by amplification of 
specific products using the polymerase chain reaction 
(PGR; Saiki et al- , 1988, Science , 239:487-491). STS 
content mapping, the analysis of STS markers present in 
5 large contiguous inserts in yeast artificial chromosomes, 
has proven to be an efficient method for assembling clone 
maps of 100 to 300 kb average resolution and has been 
successful for the assembly of low resolution maps of 
human chromosomes Y and 21 (Foote et al, 1992, supra : 

10 Chumakov et al., 1992, Nature . 359:380-387). The 

analysis of STS content in large DNA fragments carried in 
somatic cell (Delattre et al., 1991, Genomics f 9:721-727) 
or radiation-reduced cell hybrid lines (Cox et al., 1990, 
Science . 250, 245-250) also provides a powerful mapping 

15 technique . 

Large numbers of mapped STS markers have been 
isolated for several human chromosomes (Tanigami et al., 
1992, Am J Hum Genet , 50:56-64; Hori et al., 1992, 
Genomics, 13:129-133; Heding et al., 1992, Genomics . 

20 13:89-94) but do not necessarily provide the needed 

resources for large scale chromosome mapping. In many 
cases, the value of these reagents is limited because the 
probes are poorly characterized, not generally available 
to the scientific community or can not be used for 

25 routine screening under a set of standardized conditions. 
Thus, methods of producing DNA markers suitable as 
reagents for large scale chromosome mapping are desired. 

In addition, a major challenge of the human 
genome project is development of new approaches for 

30 physical analysis and sequence determination. Major 
progress has been made with the sequencing of large 
regions of DNA for significant portions of the E* coll 
genome, chromosome III of S. cerevisictB, several cosmid 
sized pieces of DNA from C. elegans, human, and a 100 kb 

35 T cell receptor region from mouse. See, e.g., Daniels et 
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al., Science , 257:771-778 (1992); Martin-Gallardo et 
al. r Nat. Genet » 1:34-39 (1992); Oliver et al., Nature . 
357:38-46 (1992); Sulston et al., Nature . 356:37-41 
(1992); and Wilson et al., Genomics . 13:1198-1208 (1992). 
5 The precise determination of each base for these 

sequences, however, has been a labor intensive and costly 
undertaking. 

In addition, the construction of high 
resolution physical maps and the acquisition of sequence 

10 have previously been considered separate efforts. Thus, 
new methods are desired, such as combining the steps of 
physical mapping and sequencing, to sequence 25%-100% of 
a particular portion of genomic DNA from megabase sized 
regions to whole genomes or chromosomes more economically 

15 than previous efforts and with reasonable accuracy. 

Summary of the Invention 

The present invention relates to a rapid and 
powerful sequence mapping method, called w sequence 
sampled mapping 11 , for sequencing complex genome, said 

20 method comprising sequencing the end-specific nucleotides 
of each member of a library of cosmid clones, and 
assembling a sequence sampled map by correlating the end- 
specific sequence information with the relative spatial 
relationship between the cosmids. The invention method 

25 is applicable to genomic DNA, preferably mammalian 

chromosomes, and in a preferred embodiment, employs a 
H bottom-up w mapping strategy, which allows for the 
simultaneous analysis of multiple cosmid clones for the 
detection of overlaps. The sequence sampled mapping 

30 method permits sequence overlaps to be determined by map 
positions, reducing the reliance on determining regions 
of unique shared sequence. 
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In a particular embodiment of the invention, 
the method comprises, in any order, grouping of cosmid' 
clones by chromosome or YAC hybridization, construction 
of high density cosmid contigs by restriction based 
5 fingerprinting, and direct and automated DNA sequencing 
from cosmid clones. 

The sequence sampled mapping method is useful 
for the completion of high density sequence-based maps, 
and ultimately, for the complete sequencing of genomic 

10 DNA directly from cosmid clones* In addition, the 

resulting sequence information allows the detection of 
many genes by sequence analysis with computer programs 
such as FASTA, BLAST, GRAIL and others under development; 
allows the development of sequence tag sites (STSs) and 

15 polymorphic repeats at an actual physical spacing of a 
few kilobases (see, e.g., Olson et al., 1989, Science . 
245:1434-1435); and allows direct PCR amplification of 
any part of the genome, independent of clone libraries. 
The invention method is also amenable to automation using 

20 the particular characteristics of the sCOS vector and 
cloning system. The resulting sequence sampled map is 
also useful, employing on-line parallel processing 
microcomputers which use existing software programs that 
have been adapted for parallel processing, for the 

25 computer analysis of genomic DNA. 

Brief De scription of Drawings 

Figure 1 shows the vector sCOS-1 designed for 
cosmid multiplex analysis. The vector contains 
bacteriophage T3 and T7 promoters flanking a unique BamHI 
30 cloning site, NotI sites for expedited restriction 

mapping and excision of the insert DNA, duplicated cos 
sites for high efficiency microcloning, a dominant 
selection for transfection into mammalian cells, Amp and 
Kn resistance genes, and ColEl origin of replication. 
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Figure 2 illustrates the construction of cosmid 
vector sCOS-1. Relevant restriction sites in the 
precursor molecules are shown. Clal-Sall and Clal-Xhol 
fragments were excised from pWE15 and pDVcos!43 and 
5 purified on agarose gels. The indicated fragments were 
joined using T4 DNA ligase and co-ligation of the Xhol 
and Sail sites resulted in the loss of both sites in the 
resulting plasmids. 

Figure 3 depicts the DNA sequences of the 
cloning site, bacteriophage promoters and flanking 
restriction sites in sCOS vectors. Restriction sites and 
T3 and T7 promoter sequences added using synthetic 
oligonucleotides are shown. Sfil, NotT, EcoRl and SacII 
restriction sites are indicated by thin lines. The 
direction of transcription using T3 or T7 polymerase is 
indicated by the arrows and the thick lines delineate the 
critical nucleotides for promoter activity. The BamHI 
site is the cloning site into which Mbol-digested genomic 
DNA is inserted. All linkers were inserted by 
"linker-tailing" into the sites formed by digestion of 
sCOS-1 with EcdRX. 

Figure 4 illustrates a strategy useful for 
analysis of physical linkage using groups of cosmids. 
Figure 4A illustrates that cosmids prepared in vector 
25 sCOS-1 or one of its derivatives can be used to 

synthesize end-specific sequences (e.g., probes for the 
detection of overlaps) . 

Figure 4B illustrates the inoculation of cosmid 
clones on the surface of a nitrocellulose or nylon filter 
30 from 96-well archive plates stored at -70° C. Each clone 
on the "grid" is assigned a unique identifying Y and X 
axis coordinate. Individual clones in the collection 
contain the innate capacity of generating probes specific 
for the extreme ends of the genomic DNA insert and 
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detecting overlapping clones on the filter . The arrows 
show the locations of potential overlapping clones 
detected by hybridization of probes generated from the 
clone at coordinates Y — 2, X = 7. 

5 Figures 4C and 4D illustrate the analysis of 

multiple clones simultaneously. Cosmids are pooled 
according to the rows and columns or the matrix, DNA 
prepared and a mixed RNA probe synthesized. When 
hybridized to the matrix filter, the probe detects a 

10 pattern of spots consisting of all of the template clones 
and the collection of clones overlapping with one end of 
each of the template clones. A similar, procedure is 
carried out using cosmids pooled according to columns of 
the matrix. When the two data sets are compared, 

15 hybridizing clones identified by both of the mixed probes 
may be overlapping with the template clone common to both 
sets: that clone located at the intersection of the row 
and column. This procedure may then be repeated using 
other combinations of pooled probes and either T7 or T3 

20 polymerase. The arrows denote the location of a clone 
which overlaps with the W T7 end M of the clone at 
coordinates Y = 2, X = 4. 

Figure 5 shows predicted contigs from human 
chromosome llq and restriction enzyme digestion analysis. 

25 Figure 5A presents the predicted linkage and orientation 
of a representative cosmid contig generated by multiplex 
analysis of the chromosome llq cosmid set and data 
analysis using the computer program "Contig-maker" . The 
computer output indicates the coordinates of linked 

30 clones (X,Y) and the arrows denote the orientation of the 
linkage. 



Figure 5B presents a restriction map and the 
location of probes used to establish unequivocal overlap 
of the cosmids. A restriction map of the overlapping 
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clones detected in Figure 5A was determined by the 
analysis of partial EcoRl digestion products hybridized 
with ^P-labeled T3 or T7 promoter-specific 
oligonucleotides. Overlapping areas not confirmed by 
restriction map analysis were confirmed by hybridization 
analysis using end-specific RNA probes generated from 
individual cosmid clones. Cosmid clones cl4,23 and 
cl9,27 are identical. □ indicates bacteriophage T3 
promoter, ■ indicates bacteriophage T7 promoter. 



Figure 6 presents physical and sequence maps of 
the (A) 0-giardin and (B) a second random genomic region 
with arrows at the ends of cosmids corresponding to their 
t7 ends. Smaller hash marks represent locally 
15 regionalized, but not fully ordered restriction 

fragments. Regions of sequence which match the genes 
annotated on the figure are shown with a greyed- in box. 

Figures 7A-F present alignment of various 
regions of sequence homology found by blast searches of 
20 the protein sequence databanks with cosmid end sequences. 

Figure 8 presents a histogram showing the 
distances between ordered cosmid ends and their frequency 
of occurrence in the two contigs determined as described 
herein. 



25 Figure 9 presents a plot of the number of 

sequenced cosmid ends versus the total amount of Giardia 
lamblia genome sequenced. Calculations are based on the 
equations of Lander and Waterman r Genomics . 2:231-239 
(1988)] assuming an overlap detection of 50 bases. 

30 Figure 10 presents regional mapping of a 

chromosome 11 STS using a panel of somatic cell hybrids. 
This analysis shows the regional mapping of an STS to bin 
2, FLpter 0.05-0.24, by PCR analysis and is typical of 
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the bulk of STS primer results derived under these 
conditions. The hybrid mapping panel breakpoints are 
shown relative to human chromosome 11. 

Figure 11 presents protein sequence alignments 
5 of putative genes detected from analysis of DNA 

sequences. DNA sequence determined from cosmid clones 
were translated into six reading frames and used to 
search GenPept, PIR or Swiss-Prot protein sequence 
databases using BLASTX . The clone name from which the 

10 sequence was derived is shown next to its translated 
sequence. The flanking numbers indicate the matching 
position in the nucleotide sequence of the cosmid or the 
protein sequence of the database entry. The one amino 
acid code translation is shown with X=any amino acid 

15 (generally caused by the inability to determine a base) 
and *— a stop codon. The X in cSRL-7d2 has a one in four 
chance of being a stop codon. 

Detailed Description of the Invention 

In accordance with the present invention r there 
20 is provided a method for sequencing complex genomes. The 
invention method comprises : 

(1) sequencing the end-specific nucleotides of 
each member of a library of cosmid clones, 

wherein said cosmid clones are prepared by 
25 inserting genomic DNA fragments into 

cosmid vectors, 
wherein the cosmid vectors include sequences of 
nucleotides that flank at least one end of 
the inserted DNA, and that serve as 
30 transcription initiation sites for the 

synthesis of a nucleic acid specific to 
the ends of the inserted DNA, and 
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(2) assembling a sequence sampled map by 
correlating the end-specific sequence information with 
the relative spatial relationship between the cosmids. 



In a preferred embodiment, the invention 
5 sequence sampled mapping method provides for the 

sequencing of the entire genome of any organism for which 
genomic DNA is available, preferably mammalian genomic 
DNA, more preferably human genomic DNA. 



As used herein, the phrase "end-specific 
10 nucleotides of each member of a library of cosmid clones" 
refers to the nucleotide sequences at the extreme 5' and 
3' ends of a given genomic DNA insert* Typically, the 
amount of nucleotides sequenced from each end-specific 
nucleotide sequence will be at least 100, preferably 250, 
15 more preferably 350, yet more preferably 550, with at 
least 1000 nucleotides being especially preferred* The 
amount of sequenced nucleotides required for the practice 
of the invention method varies as a function of the depth 
of cosmids to be sequenced. 

20 The phrase "depth of cosmids", and grammatical 

variations thereof, refers to the number of overlapping 
cosmids that contain, in common, a specified region 
(i.e., 1 nucleotide) of genomic DNA to be mapped and 
sequenced. For example, a 20X (20-fold) depth of cosmids 

25 covering a specified region of genomic DNA refers to 20 
cosmids that, on average, contain at least one nucleotide 
of genomic DNA in common - 

The depth of cosmids is chosen so as to 
maximize the number of unique genomic DNA insert ends and 
30 to provide a desired average spacing between the 

respective 5 1 or 3 r ends of two consecutive contiguous 
cosmid clones. For example, if on average, each cosmid 
contains approximately 40 kb of genomic DNA insert, then 
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a cosmid depth of 2 OX would produce, on average, a 
spacing of 1000 nucleotide base pairs between the 
respective 5 f or 3' ends of each consecutive genomic DNA 
insert. Thus, sequencing approximately 500 base pairs of 
5 the 5 1 and 3 1 ends of all given cosmids will provide 
genomic DNA sequence data for approximately 50% of the 
respective genomic DNA sample. 

The depth of cosmids can be varied by methods 
well-known in the art. One method is to select any one 

10 or combination of restriction enzymes that recognize a 

specific genomic DNA sequence, preferably a 4-bp sequence 
(■M-bp-recognizing") . Next the restriction enzyme (s) are 
employed to either partially or completely digest a given 
genomic DNA sample so as to produce genomic DNA insert 

15 fragments approximately 40-45 kb in length that are 

unique with respect to other genomic DNA insert fragments 
by as little as 100 base pairs to as great as 5 - 10 kb. 
Restriction enzymes that recognize 4 bp sequences 
suitable for use herein include, for example: Sau3A, 

20 AccII, Alul, BSP50, FnuDII r Haelll, Hhal, the 
isoschizomers thereof, and the like. 

A twenty-fold library generated by partial 
digestions of genomic DNA with several different 
restrictions enzymes would greatly increase the number of 

25 potential cloning sites and reduce the number of cloned 
ends which are exactly the same. For example, multiple 
5-10 fold deep libraries, from the same genomic DNA 
source, can each be generated with a unique four-bp- 
recognizing restriction enzyme. This would require 

30 straightforward modifications to the COS vectors 

described hereinater, e.g., adding appropriate poly linker 
sites. The uniquely restricted libraries can then be 
combined to arrive at libraries with a substantially high 
level of cosmid depth (e.g., at least about 20-fold deep, 

35 preferably at least about 40-fold deep, and more 
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preferably at least about 50-fold deep). Thus, a 21-fold 
library may be constructed from three sub-libraries of 7- 
fold cosmid depth, whereby each sub-library is produced 
with a different four-bp-recognizing restriction enzyme . 

5 One of skill in the art will recognize that by 

varying any one or both of the above-described 
parameters, sequencing of varying percentages of a given 
genomic sample becomes feasible, such as at least 25%, 
preferably 50%, more preferably 75%, with 100% (i.e., the 

10 entire genome) being especially preferred (see, e.g., 

Figure 9) . Por example, in the above-example, increasing 
the depth of cosmids from 2 OX to 4 OX and sequencing 500 
bp of each end of genomic DNA inserts provides the 
sequence for 100% of the 40 kb genomic DNA sample. 

15 Alternatively, in the above-example, increasing the 

amount of nucleotides sequenced from 500 to 1000 bp also 
provides the sequence for 100% of the 40 kb genomic DNA 
sample. Stated another way, increasing the average 
lengths of sequences determined, to about one kilobase, 

20 would result in nearly complete one-pass sequencing of a 
genome or chromosome at a fraction of the cost. 

Either prior to, concurrently, or following the 
construction of contigs or determining the relative 
spatial relationship between the cosmid clones described 
25 above, the sequencing of end-specific nucleotides step of 
the present invention, is conducted to provide the 
sequence information that will be assembled into the 
"sequence sampled map". 

The sequencing step may be carried out either 
30 manually or using an automated DNA Sequencer employing 
well-known methods, such as, for example, specific or 
degenerate primer extension, transposon primer insertion, 
ordered deletion, random shot gun sequencing, sequencing 
by hybridization, and the like. In a preferred 
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embodiment, the 5' and 3 1 ends of each cosmid clone 
within a cosmid library is subjected to "one pass" (i.e., 
sequencing only once) automated DNA sequencing as 
described in Examples 3 and 4* Automated DNA sequencing 
5 devices are well-known and widely available to those of 
skill in the art, such as, for example, the sequencing 
devices available from Applied Biosystems (e.g., ABI 373A 
Sequencer combined with a Catalyst 800 robot, Foster 
City, CA) , Pharmacia (Piscataway, NJ) , Millipore 
10 (Milf ord, MA) , and the like. 

It is recognized that automated DNA sequencing 
technology is currently progressing extremely rapidly. 
Thus, automated sequencing methods and devices that will 
allow sequencing of DNA fragments greater than 500 

15 nucleotides (i.e., 1 to 5 kb) are also contemplated in 
the methods described herein (see, e.g., Ansorge et al., 
1992, Electrophoresis , 13:616-619). Subsequently, when 
correction and verification of sequence information is 
desired for a particular region, an independent 

20 sequencing methodology may be employed, e.g., sequencing 
by hybridization, and the like. 

Raw sequence information obtained from 
automated sequencing, or sequence sampled mapped 
sequence, can be analyzed immediately using on-line 

25 parallel processing microcomputers that employ existing 
software programs adapted for parallel processing. 
Sequence analysis software programs contemplated for use 
herein include, for example: GRAIL, which locates 
protein-coding regions in genomic DNA sequences (see, 

30 e.g., Uberbacher et al., PNAS, USA, 88:11261-11265, 
1991) ; BLAST-n and BLAST-x, which compares sequence 
similarity between nucleotides and amino acid sequences, 
respectively (see, e.g., Altschul et al., J. Mol. Biol., 
215:403-410, 1990); FASTA, which identifies sequence 

35 repeats (see, e.g., Pearson et al., PNAS, USA, 85:2444- 
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2448, 1988). Raw sequence information may also be used 
advantagously to generate PCR primers useful in PCR 
assays for polymorphic repeats around specific sequences 
of interest, e.g., around "CA" nucleotide runs, other 
5 simple sequences, and the like. 

In another aspect of the invention, prior to 
the completion of a complete physical map, raw sequence 
information can be used to generate "sequence-tagged 
sites" (STSs) as described in Example 4. The STSs can be 

10 used, e.g., for producing an ordered set of YACs, for the 
analysis of sites of chromosomal pathology (e.g., 
translocations, polymorphic repeats, and inversions), and 
the like. The production of STSs allows access to 
mapping markers based upon PCR amplification of known 

15 genomic sequence. 

Briefly, as described above, DNA sequences are 
determined by sequencing directly from cosmid templates 
using primers complementary to the promoters (e.g., T3 
and T7) present in the cloning vector, oligonucleotide 

20 PCR primers are predicted by computer from a suitable 

amount of randomly selected cosmid-end-derived sequences, 
and are tested using a battery of genomic DNA templates, 
preferably corresponding to a specific chromosome. 
Cosmids are then regionally localized to the respective 

25 chromosome using fluorescence in situ hybridization 
and/or by the analysis of a somatic cell hybrid panel. 
Additional STSs corresponding to known genes and genetic 
markers on the respective chromosome may also be produced 
under the same series of standardized conditions. 

30 As used herein, a "suitable amount" of STSs 

produced can be varied by one of skill in the art to 
provide a desired coverage (preferably uniform) of a 
respective chromosome. For example, it is well within 
the skill in the art to select an amount of STSs that 
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provides an average spacing between conscecutive STSs 
within the range of about 1 kb up to about 500 kb, and 
also provides sufficient density for STS content mapping 
using YAC clones or contigs (see, e.g., 
5 Bellanne-Chaltelot et al., 1992, Cell, 70:1059-1068). 
Thus, e.g., assuming a chromosomal size of 126 mb (e.g., 
based upon chromosome 11 comprising 4.2% of the human 
genome) , a collection of 370 STSs will have an average 
spacing of one STS per 340 kb and would provide 
10 sufficient density for STS content mapping using YAC 

clones. In a preferred embodiment, the quantity of STSs 
produced by the methods described herein provides an 
average spacing between conscecutive STSs within the 
range of about 1 kb up to about 5 kb. 

15 As used herein, the phrase "assembling a 

sequence sampled map 91 refers to the step of ordering the 
nucleotide sequences obtained from the sequencing step 
into the order in which they naturally occur in the 
source genome. This is accomplished by correlating the 

20 end-specific sequence information obtained in step (1) 
above with the relative spatial relationship between the 
cosmids . 

Once the sequence sampled map has been 
determined, a minimum tiling path of cosmids (just enough 

25 to cover the region once) can be used as sequencing 

templates. Each sampled sequence can then be extended in 
both directions to triple the effective sequence reads. 
The availability of inexpensive oligonucleotide primers 
will make this sequence walking an attractive option for 

30 finishing the sequence. One attractive approach which 
may make primer walking affordable is the use of 
contiguous hexamer oligonucleotides to specifically prime 
sequencing reactions (see, e.g., Kieleczawa et. al., 
1992, Science , 258:1787-91). 
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The term "relative spatial relationship between 
the cosmids" refers to the physical napping of the 
genomic DNA inserts into the contiguous order and/ or the 
respective orientations (e.g., 5' -3* or 3 r -5 f ) in which 
5 they naturally occur in the source genome. The relative 
spatial relationship may be determined either prior to, 
concurrently, or after the step of "sequencing the 
end-specific sequences of each member of a library of 
cosmid clones." In a preferred embodiment, the relative 
10 spatial relationship between the cosmids is determined 
concurrent (i.e., in parallel) with the sequencing of 
step (1) above. 

The relative spatial relationship between the 
cosmids (i.e., physical map) can be determined by methods 

15 well-known in the art, such as fingerprinting by 

restriction enzyme mapping ("restriction-fragment-length 
mapping") employing several different restriction enzymes 
[see, e.g., Olson et al., Proc. Natl. Acad. Sci. USA 
83:7826 (1986); Coulson et al., Proc. Natl. Acad. Sci. 

20 USA 83:7821 (1986); Kohara et al., Cell 50:495 (1987); 
and the like]; the "cosmid multiplex analysis" method as 
described in U.S. Patent No. 5,219,726, incorporated 
herein by reference in its entirety; and the like. 

Restriction fragment length mapping employs 
25 full and/or partial restriction enzyme digests. The 
partial digestion provides the order of restriction 
within a cloned genomic insert, and the full digestion 
provides the exact sizes of the restriction fragments. 
Partial digestion products are detected with non- insert 
30 probes specific to either side of the cosmid vector, 

which orients the sequence relative to the genomic map. 
The partial digestion restriction maps are reconciled 
with data from the full restriction digest to obtain a 
spatially correct physical map. The restriction enzyme 
35 maps of each cosmid are compiled into an overall map of 
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the genomic region of interest, which could be derived 
from a YAC (Yeast Artificial Chromosome) , BAC (Bacterial 
Artificial Chromosome) , PAC (PI Artificial Chromosome) , 
MAC (Mammalian Artificial Chromosome) , a whole 
5 chromosome, or a small whole genome. 

In another embodiment of the present invention, 
the relative spatial relationship between the cosmids is 
determined by the "cosmid multiplex analysis 11 method as 
described in Example 2 and U.S. Patent No. 5,219,726. 

10 The cosmid multiplex analysis method depends on the use 
of cosmid vectors allowing for the synthesis of 
corresponding RNA sequences (probes) or- DNA sequences 
specific to the extreme ends of the DNA fragments 
contained within the cosmid, directly from the DNA 

15 inserts of the cosmid. 

Briefly, cosmid libraries are constructed using 
vectors containing at least one bacteriophage promoter 
adjacent to the genomic DNA insert, positioned 

20 operatively for the transcription thereof. Preferably, 
the cosmid vectors contain two bacteriophage promoters 
flanking the DNA fragment ligated into the insertion 
site. Synthesis of an end-specific RNA probe from any 
clone in the collection allows the overlapping clones to 

25 be easily detected by hybridization. Because this 
strategy does not depend on pattern recognition for 
detecting overlaps, analysis may be carried out 
simultaneously on cosmid clones grouped together. 

The "cosmid multiplex analysis 11 method is 
30 suitable for the unambiguous detection of overlapping 
regions as small as several hundred nucleotides in 
contiguous cosmids. Accordingly, the number of clones 
needed for map closure can be reduced by up to 
three-fold. Finally, this strategy represents 
35 essentially simultaneous cosmid "walking" and thus is 
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basically non-random, allowing the investigator the 
freedom to pause and investigate some interesting biology 
rather than requiring completion of the map before it 
becomes useful. 

5 It has been found that significant improvements 

in the speed and efficiency of "bottom-up 11 genomic 
mapping can be achieved, by 1) isolating restricted 
regions of large mammalian genomes in a M sublibrary H 
preorganized on a solid matrix, 2) using hybridization of 
10 end-specific probes for detection of overlapping clones 
in the collection, and 3) analyzing multiple clones 
simultaneously for the detection of all -overlaps in the 
collection. 

In accordance with the present invention, 
15 direct sequencing can then be carried out on the fragment 
ends of the individual cosmids which make up the 
resulting contig. This will generate in the range of 
about 350-550 base pairs of sequence information, 
separated by gaps of about 1-5 kb, depending on the 
20 chosen depth of cosmids employed. Thus, for a cosmid 
depth of 10-20X, in the range of about 20-50% of the 
complete chromosome sequence can be obtained very rapidly 
and at relatively low cost. 



Thus, an alternate embodiment of the present 
25 invention provides a method for sequencing complex 
genomes, said method comprising: 

(1) preparing a genomic library of cosmid 
clones by inserting DNA fragments from said genome into 
cosmid vectors, wherein the cosmid vectors include 
30 sequences of nucleotides that flank at least one end of 
the inserted DNA, and that serve as transcription 
initiation sites for the synthesis of end-specific 
probes , 
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(2) arranging the cosmid clones, whereby each 
clone may be identified and replicas of said arrangement 
may be reproduced, 

(3) pooling portions of said cosmid clones and 
5 synthesizing pools of mixed end-specific probes from the 

DNA inserts that have been prepared from said pooled 
clones, wherein each pool contains fewer than all of the 
cosmid clones in the library, but all of the cosmid 
clones in the library are included in at least one pool, 

10 (4) hybridizing each pool of probes to a 

replica of said arranged cosmid clones and identifying 
the cosmid clones in each replica that hybridize to the 
probes, wherein said identified clones include the pooled 
cosmid clones and cosmid clones that contain DNA inserts 

15 that overlap with the DNA inserts in the pooled clones, 
(5) identifying the cosmid clones from among 
those identified in step (4) that hybridize to two or 
more pools of probes, thereby identifying groups of 
cosmid clones that include overlapping DNA, 

20 (6) assembling contigs from said groups, and 

(7) sequencing the end-specific nucleotides of 
each overlapping member of said cosmid clones* 

In a preferred embodiment, the 
cross-hybridizing clones are identified by pairwise 
25 comparison of data sets obtained from two groups of 

cosmid clones containing at least one common clone. The 
cosmid clones are preferably pooled according to the rows 
and columns of a two-dimensional matrix. 

C Preferably, the cosmid vectors used in the 
30 above processes comprise two oppositely oriented 

promoters, each of which is specific for a bacteriophage 
RNA polymerase, positioned on two sides of the cloning 
site* Most preferably, the vectors contain T3 and T7 
endogenous bacteriophage promoters flanking the cloned 
35 genomic DNA. Vectors containing at least two cos sites 
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restriction sites for cloning. Plasmids with a large 
variety of cloning sites and prokaryotic and eukaryotic 
selection markers can be converted to cosmids by 
insertion of the lambda cos region - 

5 The term w cosmid clone" refers to a cosmid 

vector that contains a genomic DNA insert. The term 
"plasmid" refers to circular, double-stranded DNA loops 
which in their vector form, are not bound to the 
chromosome. The term H nucleic acid" refers to a 
10 synthetic or naturally occurring DNA or RNA molecule. 

As used herein, the term "a promoter specific 
for a bacteriophage RNA polymerase" means a wild-type or 
non-wild-type promoter that can be used by the 
bacteriophage RNA polymerase for in vitro transcription 

15 of a DNA fragment. When a non-wild-type promoter is used 
for such in vitro transcription of a DNA fragment, 
transcription will occur at a rate which is at least 10% 
of the rate at which transcription would have occurred if 
a wild-type or native promoter had been used by the 

20 bacteriophage RNA polymerase to transcribe the DNA 
fragment in vitro. 

The term "cloning site" as used herein, means 
restriction endonuclease site on the DNA sequence of the 
cosmid vectors of the present invention where a DNA 
25 fragment can be inserted without deleting any of the 
original DNA. 

Reference to a promoter positioned "operatively 
for transcription of a DNA fragment", as used herein, 
means that the promoter will be positioned in such a way 
30 that any DNA sequences between the promoter's 

transcriptional start site and the DNA fragment will not 
prevent transcription of at least a portion of the DNA 
fragment by the promoter. The term "at least a portion" 
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means that preferably at least 8 base pairs and more 
preferably at least about 30 bp of the DNA fragment will 
be transcribed. 

The terms "end-specific RNA sequences", "RNA 
5 probes", and grammatical variations thereof, are used to 
refer to hybridization probes obtained by transcription 
of corresponding DNA fragments. 



Clones are overlapping if they contain 
contiguous DNA in the same relationship as that in the 

10 genome. One method for detecting overlaps is to 

synthesize an RNA probe from one end of. a first clone. If 
this probe detectably hybridizes with an end of the 
second clone under standard hybridization conditions, the 
two clones are overlapping t Wa ^i et al., PNAS USA 84:2160 

15 (1987)]. 

The term "contig" was introduced by Rodger 
Staden, Nucleic Acids Res. 8:3673 (1980) in connection 
with DNA sequence analysis, and refers to groups of 
clones with contiguous nucleotide sequences. 

20 The cosmid multiplex analysis method employs 

essentially the strategy illustrated in Figure 4 used for 
genomic mapping using cosmid vectors. 

In a first step, a genomic library which 
represents a limited portion of a genome is constructed 

25 in a cosmid vector allowing for the synthesis of RNA 
probes and DNA strands for sequencing directly from 
insert DNA using endogenous bacteriophage promoters. A 
convenient and powerful way of subdividing the human 
genome for the preparation of libraries is through 

30 chromosome purification by flow cytometry [Gray et al., 
Cold Spring Harbor Symp. LI 1986 p. 141] . 
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Cosmid vectors suitable for constructing a 
genomic "sub- library" include the pWE vectors described 
by Wahl et al., in Proc. Natl. Acad. Sci. USA 84:2160 
(1987), e.g. pWE2, pWE4, pWE8 pWElO, pWE15, and pWE16, 
5 preferably pWE15 and pW£16. The construction of these 
vectors is described in the Materials and Methods section 
of the cited article and in Evans et al.. Methods in 
Enzymology 152 :604 (1987). These vectors, in addition to 
replication and selection functions, such as plasmid 
10 origin of replication, bacterial genes specifying 
antibiotic resistance, and the bacteriophage lambda 
cohesive termini (cos sequences) , contain the 
transcription promoters from either bacteriophage SP6 , T7 
or T3 flanking a unique BamKl cloning site. 

15 In one embodiment, cosmid vectors containing a 

duplicated cos sequence are employed. These M sCOS M 
vectors have the following important characteristics: 1) 
the presence of two cos sites such that packaging could 
be carried out with high efficiency and without requiring 

20 size selection of the insert DNA; 2) the presence of T3 
and T7 bacteriophage promoters for the synthesis of 
"walking" probes; 3) unique restriction sites for 
removing the insert and to aid in restriction mapping; 4) 
selectable genes for gene transfer in eukaryotic cells; 

25 and 5) a plasmid origin of replication giving a high 
yield of cosmid DNA for preparing templates. 

The construction of plasmid sCOS-1 is described 
in the Examples, and is illustrated in Figures 1 and 2. 
The design of this plasmid (and derivatives thereof) 
! 30 allows for rapid production of RNA probes specific for 
both ends of the inserted DNA sequences. In addition, 
this design allows for the automated sequencing of the 5 f 
and 3 1 ends of the genomic DNA insert. 
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Plasmid vector, sCOS-1, shown in Figure 1, is 
6.7 kb in size and has a cloning capacity of 31 to 48 kb. 
As with pWE vectors, bacteriophage T3 and T7 promoters 
were oriented into the J3amHI cloning site to allow direct 
5 synthesis of end-specif ic RNA probes for molecular 
"walking" and to allow sequencing the end-specific 
nucleotides of a given genomic DNA insert. Previous 
experience with pWE cosmids suggested that NotI 
restriction sites may not be ideal for excision or 

10 mapping of inserts in some regions of the genome where 
Not I sites might be clustered. Therefore, additional 
cosmid vectors with other rare restriction sites have 
been constructed, by substituting the cloning/polymerase 
sites of sCOS-1 with sequences containing Not I and Sacll 

15 sites (sCOS-2) or Sf il sites (sCOS-4) . The asymmetric 

rare sites in sC0S-2 are useful for cloning ends of large 
Not I or Sacll fragments for isolation of "linking" clones 
for long range mapping by pulsed field gel analysis 
[Buiting et al.. Genomics 2:143 (1988)]. Also, vectors 

20 which lack NotI sites, such as sCOS-4, would potentially 
allow the selection of clones containing unique NotI 
junction fragments by hybridization with Notl-specif ic 
oligonucleotides [Estivill et al.. Nucleic Acids Res. 
15:1415 (1987) ]. 

25 The double cos site sCOS vectors make feasible 

the preparation of representative libraries from very 
small amounts of purified, partially digested DNA, and 
are, therefore, presently preferred for carrying out the 
method of the present invention. 

One of skill in the art can modify the sCOS 
vectors employed herein in a variety of ways to make them 
more suitable for restriction-fragment-length mapping 
using partial digestion (see, e.g., Kohara et al., Cell 
50:495-508 (1987). For example, an "sCOS-derivative" 
vector useful in the invention methods described herein 
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can include, intron encoded endonuc lease sites which 
recognize approximately 20 base pairs, and a multiple 
cloning site (i.e., polylinker) which will ligate with 
partially-digested genomic DNA from most commercially 
5 available restriction enzymes that recognize four bases 
and generate staggered ends. 

For determining the relative spatial 
relationship between the cosmid clones, the individual 
clones of the genomic library are arranged on a 
nitrocellulose or nylon filter matrix and each clone is 
identified by unique coordinates* If the randomly chosen 
clones are arranged in a two-dimensional matrix, they are 
identified by unique X and Y coordinates. For 
convenience in handling, the pattern of the matrix is 
preferably based on the pattern and spacing of wells of a 
standard 96-well microtitre plate and the repetitive 
preparation of culture plates and hybridization filters 
may be carried out using equipment designed for working 
with this standard. Each individual cosmid clone in the 
collection possesses the innate mechanism of generating 
an RNA probe capable of detecting any overlapping or 
identical clones in the collection. 

If an RNA probe is generated using T3 or T7 
polymerase, and overlapping clones are detected by 
25 hybridization of the probe to a replica of the filter 
grid, using cosmid clones arranged on a 36 x 36 matrix 
containing 1296 clones, all of the overlaps can be 
detected by carrying out 1296 T3 polymerase reactions, 
1296 T7 reactions and subsequent hybridization reactions. 

30 However, as an alternative to the individual 

analysis of cosmid clones for the detection of overlaps, 
simultaneous analysis of multiple cosmid clones in groups 
can be conducted, as described in U.S. Patent No. 
5,219,726. Accordingly, this preferred strategy allows 
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the analysis of a collection of cosmid clones with far 
less effort than M fingerprinting" each of the clones 
selected individually. 

The linked clones detected by the above method 
5 can then be grouped into contigs, either manually or, 
preferably, using appropriate computer programs. To 
confirm the correctness of the groupings, some of the 
contigs can be subjected to detailed restriction enzyme 
analysis, and the degree of physical overlap along with a 
10 physical map can be determined. To complete a physical 
genomic map, the above-outlined procedure can be repeated 
with as many clones as necessary, and the gaps between 
the contigs can be filled in, e.g. by traditional 
chromosome walking. 

15 The method described hereinabove represents a 

special case of a more general sequence mapping strategy 
based on clone matrices of higher order, such as, for 
example, greater than or equal to 3 dimensions (as 
described in U.S. Patent No. 5,219,726). 

20 Table 1 

Theoretical analysis of genomes of various sizes by 
cosmid multiplex analysis where clones are organized in 
matrices of various dimensions and clones analyzed using 
probes prepared from groups of cosmid clones. 

25 dimension* matrix probe pool number of number of 

size size clones analyzed mixed probes 
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* Where a matrix of n dimensions containing d n 
cosmids is used for the arrangement of cosmid 
clones t pooled probes of d 11 " 1 members are 
prepared which allow for the analysis of d n 
5 individual clones by multiplex analysis using 

n x d mixed probes. For example, using a 
three dimensional matrix of 10 x 10 members, 
pooled probes of 100 individual clones could 
be used to analyze 1000 individual cosmids 
10 with 30 analytical reactions. 

Given that the theoretical limit of detectable 
signal from pooled RNA probes is not reached, cosmids 
representing the entire human genome with four-fold 
redundancy can potentially be analyzed by the "multiplex" 

15 using a 5-dimensional matrix of n=13 in. only 65 

hybridization reactions . The theoretical limitation of 
this strategy seems to be the number of individual clones 
which can be pooled and still give reproducible 
hybridization signal* Current protocols suggest that 

20 this limitation may be somewhat greater than 100 clones. 

Further details of the invention are 
illustrated by the following, non-limiting examples. 

Examples 

Unless otherwise stated, the present invention 
25 was performed using standard procedures, as described, 
for example in Maniatis et al., Molecular Cloning: A 
Laboratory Manual . Cold Spring Harbor Laboratory Press, 
Cold Spring Harbor, New York, USA (1982) ; Sambrook et 
al., Molecular Cloning: A Laboratory Manual (2 edJ. Cold 
30 Spring Harbor Laboratory Press, Cold Spring Harbor, New 
York, USA (1989); Davis et al., Basic Methods in 
Molecular Biology , Elsevier Science Publishing, Inc., New 
York, USA (1986) ; or Methods in Enzymologv: Guide to 
Molecular Cloning Techniques Vol.152, S. L. Berger and A. 
35 R. Kimmerl Eds., Academic Press Inc., San Diego, USA 
(1987) . 
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Cosmid vectors 

Genomic libraries were constructed in cosmid 
vector sCOS-1 (illustrated in Figure 1) . sCOS-1 was 
prepared from cosmid vectors pWE15 [see Evans et al. , 
5 Methods in Enzymology 152:604 (1987)]; ATCC Accession No. 
37503, American Type Culture Collection (ATCC) , 12301 
Parklawn Drive, Rockville, Maryland 20852 USA] and 
pDVcosl34 [a gift from J. Reese, in wide circulation 
among scientists] . pWE15 DNA was digested with Clal and 

10 Sail, and the 6kb Clal-Sall restriction fragment , lacking 
the cos sequence was purified. Cosmid pDVcos!34 was 
digested with Clal and Xhol and a restriction fragment 
containing the duplicated cos region was purified on a 
low melting point agarose gel. The purified fragments 

15 were ligated using T4 DNA ligase and transformed into E. 
coll host strain DH5, which is a derivative of the 
strongly recA" strain DH1 (commercially available , e.g. 
from Bethesda Laboratories, Gaithersburg , MD, USA). 
Alternatively , purified fragments can be transformed into 

20 such host cells as AG1 (Stratagene Cloning Systems, San 
Diego, CA) , a derivative of DH5 selected for high 
packaging efficiency, or into HB101 (commercially 
available, e.g. from Bethesda Laboratories, Gaithersburg, 
MD, USA) . 

25 Other pWE plasmids suitable for genomic mapping 

according to the invention are disclosed in Evans et al . , 
Methods in Enzymology, Supra . Cosmid vector pWE16 has 
been deposited with the American Type Culture Collection, 
and has been accorded ATCC No. 37524. 

30 Cosmids sCOS-2 and sCOS-4 are derivatives of 

sCOS-1 where the cloning site has been altered to 
substitute other rare restriction sites for the NotI 
sites. Cosmid vector sCOS-2 was constructed by digesting 
sCOS-1 with EcoKL, and purifying the plasmid DNA away 
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from the NotI-T3 promoter-BamHI-T7 promoter-Notl linker 
sequence by ethanol precipitation* A 30-nucleotide 
double-stranded synthetic oligomer with EcoRI coadhesive 
ends, containing J/otI-T3 promoter-BaiaHI-T7 promoter-SacII 
5 sequences was added by linker-tailing [Lathe et al., DNA 
2:173 (1984)]. sCOS-4 was constructed using a similar 
procedure adding a double-stranded synthetic 
oligonucleotide containing .EcoRI coadhesive ends and a 
Sfil-T3 promo ter-BamHI-T7 promoter-Sfil sequence. 

10 EXAMPLE 1 

Construction of cosroid libraries in sCOS vectors 

High molecular weight genomic DNA for cosmid 
cloning was prepared by proteinase k digestion and gentle 
phenol extraction followed by dialysis [DiLella et al., 

15 Methods in Enzymology 152:199 (1987)]. The average 

molecular size of the isolated DNA was determined using 
field inversion gel electrophoresis [Carle et al., 
Science 232 :65 (1986)] and ranged from about 500 kb to 
greater than 3 mb. DNA was digested with Mbol under 

20 conditions recommended by the manufacturers and the 
digestion terminated by phenol/chloroform extraction - 
Following digestion, the DNA was analyzed on field 
inversion gels or 0.3% agarose gels to determine the 
average size of the digestion products. For the 

25 construction of genomic libraries in cosmid vector sCOS-1 
genomic DNA was digested to an average size of 100 - 120 
kb # and dephosphorylated with calf intestinal 
phosphatase. The genomic DNA was not size separated 
before cloning. 

30 Vector cloning arms were prepared by first 

digesting purified sCOS vector DNA with Xbal followed by 
dephosphorylation with calf intestinal alkaline 
phosphatase. The reaction was terminated by 
phenol/chloroform extraction and the DNA collected by 
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ethanol precipitation. The linearized, dephosphorylated 
vector DNA was then digested with BamHI, extracted with 
phenol/chloroform and stored at a concentration of 1 
mg/ml in 20 mM TRIS'HCl, pH 6, 1 mM EDTA. Ligations were 
5 performed using 1 Mg of vector arms and 50 ng to 3 jig of 
genomic DNA. Reactions were incubated with 2 Weiss Units 
of T4 DNA ligase and packaged using commercial in vitro 
packaging lysates. Bacteriophage lambda packaging 
extracts may contain significant amounts of EcoK 

10 restriction activity. To avoid the possibility that 
mammalian sequences containing an EcoK site might be 
underrepresented in the library, genomic libraries are 
prepared using in vitro packaging extracts which lack 
EcoK restriction activity (e.g. Gigapak-Gold; Stratagene 

15 Cloning Systems, San Diego, CA) . 

Cosmid libraries were plated directly on LB 
agar (LB media containing 1.2% Bacto-agar; autoclave) 
containing 25 jtig/ml of kanamycin sulfate and libraries 
screened without further amplification [Evans et al., 

20 Methods in Enzymology 152 :604 (1987)]. Libraries were 

stored as original non-amplified plate stocks in LB media 
(10 g Bacto-tryptone, 5 g yeast extract, 5 g NaCl per 
liter of water; autoclave) with 15% glycerol at a 
concentration of 2.2 x 10 11 bacteria/ml at -70°c. The 

25 cosmid library used in the study described in the 
examples consisted of 1.5 x 10 7 independent clones. 

Selection of Human Clones from a Somatic Cell 
Hybrid Genomic Library 

Cosmid libraries were plated on 570 cm 2 LB agar 
30 trays at a density of 10 clones/cm , replica filters 

prepared and filters hybridized with human placenta DNA 
labeled with 32 P-dCTP to a specific activity of 10 8 
cpm/pg. Under these hybridization conditions, no 
background hybridization was detected against cosmids 
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carrying mouse genomic DNA. Cosmids containing human 
genomic DNA inserts were picked with toothpicks, 
rescreened by hybridization to -labeled human DNA, and 
archived in 96-well microtitre plates containing LB 
5 media, 15% glycerol and 25 Mg/ml kanamycin sulfate at 
-70 °C. Individual clones isolated from cosmid libraries 
were routinely grown, replicated, and DNA prepared using 
standard round-bottom 96-well microtitre plates. Replica 
transfer of clones in 96-well microtitre plates and 
10 transfer from archived plates to screening filters was 
carried out using an aluminum "hedgehog" made from 3 -mm 
diameter brass rods set in plastic block, as described by 
Coulson et al., Supra (p. 7822), or a laboratory robot 
(Beckman Biomek 1000) . 

15 Plating and Screening Libraries 

For multiplex analysis, archived cosmids were 
inoculated on the surface of a nitrocellulose or nylon 
based filter in a matrix or "grid" pattern. The size and 
density of the "grid" was determined by the pattern of 

20 wells in a standard 96-well microtitre plate and, in the 
experiments described in the examples, a 36 x 36 matrix 
was used. Before applying bacterial culture, a matrix 
pattern prepared on paper was transferred directly to the 
filter membrane by passing the filter through a copying 

25 machine followed by autoclaving. The clones were allowed 
to grow on the surface of the filter at 37 °C for 12 to 15 
hours and bacterial DNA was fixed to the filter using a 
standard colony lysis procedure [Vogeli et al.. Methods 
in Enzymology 152 :407 (1987)]. 

30 RNA Probe Synthesis and Hybridization Reactions 



Cosmids were transferred from archives to fresh 
96-well plates containing liquid LB media with 25 fig/ml 
kanamycin sulfate and incubated at 37 °C in a humidified 
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atmosphere for 6 to 10 hours. Supernatants from 
individual wells were pooled and DNA prepared using a 
previously described cosmid miniprep procedure [Evans et 
al., Methods in Enzymology, Supra]. Cosmids constructed 
5 with vector sCOS-1, or one of its derivatives, yield up 
to 2 of DNA from a 300 /il culture. All probe 
synthesis and mapping reactions were carried out with DNA 
prepared from minilysates. In some cases, the pooled DNA 
was digested with a restriction endonuclease such as 

10 BaroHI or Hindlll prior to probe synthesis. RNA probes 
were synthesized using bacteriophage T3 or T7 polymerase 
(Stratagene Cloning Systems, San Diego, CA, USA) . Thus, 
cosmid DNA was prepared and 1-2 /ig of the DNA was 
transcribed with T7 or T3 RNA polymerase in a 20 /il 

15 reaction, as described by Melton et al. (1984) Nucleic 
Acids Research 12:7035-7054, using 50 pel of [a- 32 ?] OTP 
and 12 }M unlabeled UTP. 32 P-DTP and polymerase reactions 
were terminated by extraction with phenol and chloroform. 
100 /il of blocking mixture (a mixture of sonicated human 

20 placenta DNA and cloned human repetitive sequences at a 
concentration of 1 mg/ml) was added, and the probe 
mixture was precipitated with ethanol . The nucleic acid 
was then resuspended in 20 pi of 5X saline sodium 
phosphate EDTA (SSPE) , 0.1% sodium dodecyl sulfate (SDS), 

25 and prehybridized for 5 minutes at 42 °C to saturate 

repetitive sequences which might be present in the probe. 
The probe was then added to a plastic bag containing a 
replica of the matrix filter and hybridization buffer (5X 
SSPE, 50% formamide, 0.2% SDS, IX Denhardt f s solution 

30 (i.e., 0.2% Ficoll, 0.2% polyvinyl pyrrolidone, 0.2% 
bovine serum albumin; see Denhardt, Biochem. Biophys. 
Res. Commun. 23:641 (1966)), and 20 Mg/ml salmon sperm 
DNA) and the hybridization reaction carried out for 12 to 
18 hours. Filters were washed in 0.1X SSPE, 0.1% SDS, at 

35 65 °C and exposed to X-ray film for 2 to 8 hours. 



WO 94729486 



PCT/US94/06810 



33 

End-specific Nucleotide Sequence Analysis 

Cosmids are transferred from archives to fresh 
96-well plates containing liquid LB media with 25 /ig/ml 
kanamycin sulfate and incubated at 37 °C in a humidified 
5 atmosphere for 6 to 10 hours. DNA is prepared using a 
previously described cosmid miniprep procedure [Evans et 
al-, Methods in Enzymology, Supra]. DNA from each cosmid 
is sequenced in a commercially available DNA Sequencer 
following the manufacturer's instructions. 

10 Restriction Enzyme Analysis 

Restriction enzyme analysis of isolated cosmids 
was carried out using DNA isolated from minilysates. 
Cosmid DNA was prepared from minilysates as follows: 

DNA was isolated from 1.5 ml cultures. A 
culture was inoculated with a single bacterial colony and 
incubated with vigorous shaking at 37 °C for 6 hours. DNA 
was prepared using a modified boiling procedure [Evans et 
al., Methods in Enzymology 152:604 (1987)]. Cells were 
collected by a brief (1 min.) centrifugation in a 
microcentrifuge and cells were resuspended in 300 jil of 
STET buffer (50 /iM TRIS-HCl, pH 8.0, 8% sucrose, 5% 
Triton X-100 and 50 mM EDTA) . 20 /il of freshly prepared 
lysozyme (10 mg/ml) in STET buffer were added, the 
mixture vortexed and incubated in a boiling water bath 
for one minute. The solution was immediately centrifuged 
for 10 minutes in a microcentrifuge and the gelatinous 
pellet removed with a toothpick and discarded. 325 /il of 
isopropanol were added and the mixture incubated at room 
temperature for 5 minutes. The precipitated DNA was 
collected by centrifugation at room temperature in a 
microcentrifuge, the pellet dried and resuspended in 
water. 
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DNA was digested to completion with Notl, 
digested partially with one or more enzymes (typically 
BamHI, EcoRI, ffindlll, SacII, PvuII, and Kpnl) , separated 
on an agarose gel, transferred to a nitrocellulose filter 
5 and hybridized with 32 P-labeled oligonucleotides 

recognizing the T3 or T7 bacteriophage promoters. T3 and 
T7 oligonucleotides (commercially available as sequencing 
primers, Stratagene Cloning Systems, San Diego, Ca, USA) 
were labeled using polynucleotide kinase and y^ 2 ? ATP to 

10 a specific activity of 2 x 10 8 cpm/pg. The labeled 

oligonucleotides were then hybridized to the filters in 
6X saline sodium citrate (SSC) , 10% Denhardt f s solution 
for 12 hours at 42 °C and washed in 2X SSC for 10 minutes 
at 50°C. Filters were exposed to X-ray film for 20 

15 minutes to 12 hours. The pattern of bands appearing on 
. the autoradiograph could then be interpreted as 
indicating the distance from the cloning site to the 
restriction site, much as with the "cos "-mapping 
procedure of Rackwitz et al., Gene 30:195 (1984). 

20 Alternatively, programmed automatic restriction 

enzyme digestions were carried out to completion in 96- 
well microtitre plates using a laboratory robot (Beckman 
Biomek 1000) . 

Data Analysis 

25 The resulting hybridization data were manually 

entered into a computer file and analyzed using computer 
programs written in Turbo Pascal (Borland International) 
running on Apple Macintosh II or Macintosh SE computers. 
One program "Multiplex-mapper" compared data sets from 

30 hybridization reactions using mixed probes, determined 

those clones which were identified by more than one probe 
mixture, and produced a list of linked clones. A second 
program, "Contig-maker" assembled the list of overlapping 
clones into potential contigs which could be analyzed in 
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greater detail. In some cases , orientation and overlap 
of individual cosmid clones in a contig were confirmed by 
detailed restriction mapping and hybridization analysis 
of the individual cosmid clones. 

5 Although data analysis was performed using the 

above-mentioned computer programs, a person of ordinary 
skill in the art should have no difficulty in carrying 
out the comparison of sequence and/or hybridization data 
and assembling the overlapping clones into contigs using 
10 other software. Moreover, manual data comparison and 
contig making are also possible, though more laborious. 



EXAMPLE 2 

Cosmid Multiplex Analysis of Human Chromosome llq 

The cosmid vector sCOS-1 (Figure 1) was used to 

15 prepare a genomic library from a somatic cell hybrid 

containing as its only human material DNA from the distal 
long arm of human chromosome 11, including llg21-llqter, 
in a mouse background [Maslen et al., Genomics 2:66 
(1988)]. The distal long arm of human chromosome 11 is 

20 of biological interest for a number of reasons. Like the 
major histocompatibility complex, the T cell receptor and 
immunoglobulin genes, and the IgK-CD8A-CD8B region of 
chromosome 2pl2, human chromosome llq23 contains a 
cluster of genes encoding proteins which are members of 

25 the immunoglobulin superfamily and are possibly important 
for cell-cell interactions in the immune and nervous 
systems including Thy-1, CD3, 8, epsilon, and N-CAM 
[Nguyen et al., J. Cell. Biol. 102:711 (1986)]. Ilg23 is 
the location of genes in which defects may be responsible 

30 for ataxia telangiectasia [Gatti et al., Nature 336 :577 
(1988)] and other hereditary disorders including multiple 
endocrine neoplasia type I [Larsson et al., Nature 332:85 
(1988)], diabetes analogous to the NOD mouse [Prochazka 
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et al., Science 237 :286 (1987)] and others are also 
likely linked to markers on leukemias and pathognomonic 
for Ewing's sarcoma, peripheral neuroepithelioma and 
Askin's tumor [Griffin et al., Proc. Natl* Acad. Sci. USA 
5 82:6122 (1986) ]. The initial physical analysis of human 
chromosome 11 should allow eventual analysis of the genes 
associated with these phenomena and the underlying 
biology. 

A genomic library consisting of 1.2 x 10 7 

10 individual members was prepared and cosmids containing 
only human DNA were selected from this library by 
screening with probes recognizing human, repetitive 
sequences. The proportion of human clones in this 
genomic library was 0.9%, indicating that the proportion 

15 of human chromosome 11 present in the somatic cell hybrid 
was about 27 mb, consistent with previous cytogenic and 
molecular characterization of this cell line [Maslen et 
al., Supra 1 . 1296 clones were selected, archived in 
96-well microtitre plates, and arranged on a 

20 nitrocellulose filter according to the columns and rows 
of a 36 x 36 matrix. Using probes recognizing many 
available DNA markers mapping to this chromosome, 
consmids containing the genes THY1 [van Rijs et al., 
Proc. Natl. Acad. Sci. USA 2:5832 (1985)], T3D, T3E 

25 [Evans et al., Immunogen 28:365 (1988)], ETS1 [Watson et 
al., Proc. Natl. Acad. Sci. USA 83.: 1792 (1986)], PBG 
[Wang et al. , Proc. Natl. Acad. Sci. USA 78:5734 (1981)], 
PGR [Hisrahi et al., Biochem. Biophys. Res. Commun. 
143:740 (1987)], SRPR [Lauffer et al., Nature 318:334 

30 (1985)], and AP0A1 [Karathanasis et al., Proc- Natl. 

Acad. Sci. USA 80:6147 (1983)]. The identified genes and 
clone coordinates for DNA markers on human chromosome 
llq-llqter represented in the ordered cosmid set are 
shown in Table 2. 
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Table 2 

Identified genes and clone coordinates for DNA markers on 
human chromosome llq21 - llqter represented in the 
ordered cosmid set. 

5 cosmid clone marker 

coord ina t e (Y.X) 

3/11 
11,34 
10,7 

10 7,13 

5,13 
4,22 
13,27 
11,25 

15 12,19 

24,12 
24,8 
11,31 



THY1 
THY1 

T3D (CD3K,*) 

ETS1 

ETS1 

PGR 

APOA1 

PBGD 

D11S23 

D11S24 

SRPR 

SRPR 



20 Additional available RFLP markers [Maslen et 

al., Supra 1 were also identified in this collection to 
allow eventual correlation of the emerging physical map 
of chromosome 11 with the linkage map. 

Groups of clones corresponding to 32 of the 

25 rows and 36 columns were pooled, and 68 hybridization 
reactions were carried out to replica filters according 
to the strategy outlined hereinbefore. Mixed probes 
detected a minimum of nine and a maximum of 46 cross- 
hybridizing unique clones on the filter matrix with each 

30 hybridization reaction using a pooled probe. When 
hybridization is carried out with a mixed probe 
consisting of RNA transcripts from cosmids of a row of 
the matrix, and a mixed probe representing a pool of all 
cosmids aligned along a column of the matrix, the cosmid 

35 clone which hybridizes with both mixed probes is linked 
to the clone located at the intersection of the row and 
column from which probe mixtures were prepared- To aid 
in the analysis of the data generated by this procedure, 
the Y and X coordinates of the cross-hybridizing clones 

40 are entered into a computer and matches identified using 
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one of several computer programs. From this series of 
experiments, 1099 linked clones were detected from the 
hybridization of 36 pooled columns and 32 pooled rows of 
the matrix. Several of these overlapping clones were 
5 analyzed by restriction mapping to confirm that the 
clones indeed did overlap in the expected manner. 

Completeness of the Cosmid Multiplex Data 

From the list of linked clones produced by this 
multiplex technique, contigs were assembled either 
manually or through computer analysis of the data from 
the predicted hybridization linkage using mixed multiple 
RNA probes. Based on an initial analysis of the data 
using a simple algorithm for contig construction, 315 
contigs were assembled from the 1099 linked clones 
determined from multiplex analysis. The size of the 
contigs ranged from 2 linked cosmids to 27 cosmids 
grouped into a contig extending over several hundred kb, 
with the majority of contigs consisting of between 2 and 
5 cosmids. To confirm that these groupings reflected the 
true structure of the human chromosome, and not 
artif actual groupings due to random cross-hybridization, 
several of the contigs were restriction mapped in detail 
to determine the degree of overlap and establish a 
physical map. The restriction map of a representative 
contig assembled by this strategy is shown in Figure 5. 

Assessment of Progress 

Based on the assumption that the region of 
human chromosome 11 carried by the parent hybrid 
represents about 27 mb, the collection of 1296 cosmid 
30 clones analyzed here represents about 2 genome 

equivalents- It is also estimated that the minimal 
detectable overlap by hybridization analysis using end- 
specific RNA probes is about 200 nucleotides. If e is 
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the fraction of length of two clones which must be shared 
in order for overlap to be detected [Lander et al. f 
Supra ] , then the expected number of contigs consisting of 
at least two clones generated by the analysis of N cosmid 
5 clones is 

Ne -c(i-e). Ne -2c(i-e) 

wherein the redundancy of coverage, c = LN/G, where L is 
the length of the clone insert and G is the haploid 
10 genome length in base pairs. 

The minimum detectable overlap with end- 
specific RNA probes is e=0.005. Approaching the 
theoretical limit of e— 0, a maximum of about 450 contigs 
would be expected to result after the analysis of one 

15 genome equivalent and about 260 contigs after the 

analysis of 2 genome equivalents. Thus the analysis of 
the clone set carried out here, generating 315 contigs 
after the analysis of about two genome equivalents, is in 
good agreement with theoretical predictions. The main 

20 advantage of the current strategy is that the analysis of 
1296 clones required only 72 analytical reactions, rather 
than 1296. 

It was found that the prehybridization of the 
RNA probes with a high concentration of human repetitive 

25 sequences, as hereinabove described, was sufficient to 
completely block hybridization of most of these 
frequencies, and was sufficient for eliminating most of 
these artif actual linkages. However, the analysis of 
several large contigs mapping to human chromosome 11 

30 generated by this analysis has revealed several cosmid 
clones which were included in a contig but which could 
not be substantiated based on the result of restriction 
mapping and hybridization analysis. This artifact may be 
the result of cryptic low-frequency repetitive or 

35 redundant sequences present in this region of the genome, 
or could be the result of genomic sequences which are 
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unstable and deleted or rearrange when cloned in E. coli. 
Evidence for the later sequences, isolated through 
screening non-amplified cosmid libraries, has been found 
in the analysis of the human CD3 locus [Evans et al. # 
5 Immunogen, Supra 1 . However, it should be noted that the 
multiplex technique of the present invention, when 
carried to completion using both T3 and T7 mixed RNA 
probes, generates data that is internally redundant in 
that both members of a linked pair should cross-hybridize 
10 with one another. Thus, further refinement of this 

approach should eliminate most serious artifacts arising 
during multiplex clone analysis. 

In this regard, the analysis disclosed in the 
present invention has generated a partially overlapping 

15 cosmid set which is estimated to include about 60% of the 
llq21-llqter region of human chromosome llq. The results 
of certain preliminary restriction enzyme analyses, 
further analysis of contigs and filling-in by traditional 
chromosome walking are in complete agreement with 

20 theoretical calculations of fingerprinting efficiency. A 
more complete analysis of this and other chromosome 
regions using a number of cosmids for 4 or more genome 
equivalents would be expected to result in near closure 
of the map. Using the technique of the present 

25 invention, this would require a collection of about 3600 
cosmids and 120 T3 or T7 reactions/hybridizations rather 
than the 72 carried out in the present Example. In 
addition, the technique of the present invention is 
applicable for traditional chromosome "walking" to allow 

30 "f illing-in" of gaps in a near complete map. 

Additional analysis of this cosmid set 
representing chromosome llq can be completed by automated 
restriction mapping. Analysis to date has revealed the 
presence of 177 potential "linking" clones, containing 
35 one or more NotI restriction sites, and 77 clones 
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containing SacXI sites indicative of hypomethylated CpG- 
rich islands. Forty of these cosmid clones contain 
clustered rare CpG rich restriction sites and can be 
identified unequivocally as hypomethylated islands. In 
5 addition, cosmid clones have recently proved very useful 
for in situ hybridization to metaphase or interphase 
chromosomes [Lichter et al., Proc. Natl. Acad. Sci. USA 
85:9664 (1988)] and the identification of the cytogenic 
location of single-copy DNA sequences. These procedures 
10 potentially will allow ordering cosmid contigs with 

resolution of greater than 500 kb and, coupled with the 
strategy described here, provide a powerful mechanism for 
the constructions of physical maps of chromosomes. 

Still further analysis of this cosmid set 
15 representing chromosome llq can be carried out by direct 
automated DNA sequencing of the fragment ends as 
described above in the Detailed Description and in 
Examples 3 and 4. 

EXAMPLE 3 

20 Sequence-Sampled Map of G. lamblia genome 

A 10.5 mb genome of Giardia lamblia (Fan et 
al., 1991, Nucleic Acids Res , 19:1905-1908) was cloned as 
a twenty-genome equivalent cosmid library. Five thousand 
cosmids can be mapped, end-oriented and end-sequenced 
25 generating 10000 ordered sequence fragments spaced, on 

average, every one kilobase. The determination of 500 bp 
of DNA sequence directly from each cosmid end results in 
an average spacing between islands of 0.5 kb. 

Cosmid clones 



30 



A Giardia lamblia cosmid library was 
constructed in vector sCos-1. The library was prepared 
by partially digesting WB strain Giardia lamblia genomic 
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DNA with Sau3A and cloning the digestion products into 
the BamHI site. From primary platings, 6,250 clones, 
representing about 2 OX coverage were picked and 
individually archived in 96-well microtitre plates and 
5 stored as frozen glycerol stocks. Clone names are 

derived from the library designation (GR) followed by the 
plate number and well position (e.g., cGR-2bll denotes a 
cosmid located on plate 2 in well bll) . These were 
arrayed onto filters with a Biomek 1000 robot (Beckman 

10 Instruments, Fullerton, CA) and the DNA fixed onto the 
filters using well known methods for analysis by 
hybridization. A specific hybridization probe to the 0- 
giardin region was used to detect sixteen overlapping 
cosmids and a random probe detected another 26 

15 overlapping cosmids. These cosmids were subjected to a 
series of mapping and sequencing experiments. 

Automated DNA sample preparation 

Template cosmid DNA was prepared by an alkaline 
lysis procedure (Sambrook et al., su pra ) . or by using one 

20 of several DNA prep robots. Automated procedures used 
DNA prepared by an Autogen 540 DNA preparation robot 
(cycle 411) , which was subsequently digested for one hour 
with RNAse A (75 pg/ul) in a total volume of 23 //I, and 
then precipitated with ethanol. Overnight growth of 

25 cosmid clones in 5 mis of terrific broth (see, Sambrook 
et al*, supra ) with 10 M9/nl kanamyacin generally 
provided sufficient DNA for one or two sequencing 
reactions. 

Automated DNA sequencing 

30 DNA sequencing was carried out using primers 

complementary to the T3 or T7 polymerase promoter located 
in the cosmid vector flanking the insertion site. 
Template cosmid DNA was prepared by an Autogen 540 DNA 
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preparation robot (cycle 411) . Automated sequencing 
reactions were carried out using dye labeled T3 or T7 
oligonucleotide primers with reactions assembled and 
cycle sequenced using the Applied Biosystems (ABI, Foster 
5 City, CA) Catalyst 800 robot with DNA concentrations of 
-0.2 /ig/pl. Sequence determination was carried out using 
the ABI 373A fluorescent sequencer. The labels (names) 
given to each of the DNA sequences determined from the 
cosmid clones are the clone names followed by -t or -u, 

10 denoting sequences from the T3 or T7 priming sites, 

respectively (e.g. , cGR-19a9-u is the sequence of the T7 
end of the cosmid clone found on plate 19 in position A9 
in the GR library) . The increased organization permitted 
by the new nomenclature system used in Examples 3 and 4 

15 will facilitate the large scale physical mapping 
strategies described herein. 

Contia construction and physical mapping 

Identification of cosmids from the /?-giardin 
genomic region was accomplished using filters 

20 representing five- fold of the twenty-fold genomic 
library. Bacterial clones were stamped into a grid 
pattern with a Biomek 1000 robot using S & S Nytran 
filters with a 0.4 //m pore size and fixed to the filters 
using standard fixation procedures. A probe of 

25 approximately 1 kb recognizing 0-giardin genomic DNA was 
generated by PCR with the oligonucleotides 
GGTCAAGCTCAGCAACATGA (SEQ ID NO:l) and 
TGCTTTGTGACCATCGAGAG (SEQ ID NO: 2) with standard 
amplif ication conditions and an annealing temperature of 

30 60° C. Similarly, a random probe was generated as a 1.5 
kb product with the primers CAGCAGATGGTCAAGCAAAA (SEQ ID 
NO: 3) and ACTCCTGACACCACCACCTC (SEQ ID NO: 4) . 

Physical and sequence maps were constructed by 
full digestion of each cosmid with the restriction 
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enzymes Ncol and Bgl II. Restriction fragments were 
separated on a 0.4% high strength agarose gel (0.5X TBE; 
see Sambrook et al., supra ) run for 24 hours at 22v in a 
20 by 20 cm gel apparatus. The Ncol and Bglll 
5 restriction enzymes digest the vector opposite the 

cloning site generating genomic fragments. Each end of 
the genomic insert was detected as a vector /genomic 
chimera by hybridization with probes flanking the T3 and 
T7 promoter sites of sCos-1* The 1046 bp t3 probe was 

10 amplified from sCos-1 with the primers (5" to 3') 

TCGCTCACTGACTCGCTG (SEQ ID NO: 5) and AGCCCTCCCGTATCGTAGTT 
(SEQ ID NO: 6) , and the 1004 bp T7 probe with the primers 
CTTGAGAGCCTTCAACCCAG (SEQ ID NO: 7) and AACTGGGCGGAGTTAGGG 
(SEQ ID NO: 8) with an annealing temperature of 60°C. and 

15 the standard conditions described above. The T7 probe 
was labeled by random priming with 5 S dATP and the t3 
probe with ^P dATP for dual-label hybridizations. Maps 
were constructed by determining an order of fragments 
with no gaps with the gram program [see, e.g., Soderlund 

20 et al., in Proceedings of the 26th Hawaii International 
Conference on System Sciences: Biotechnology Computing. 
(ed. Hunter, L. ) 620-630 (CA: IEEE Computer Society 
Press, 1993)]. 

The mapping strategy consisted of digesting 
25 each cosmid with two different restriction enzymes and 

precisely determining the number and sizes of the various 
products. The enzymes Ncol and Bgl 1 1 were chosen for 
mapping because their recognition sites have different 
numbers of G/C bases and their sites are located several 
30 kb from the vector cloning site. Fragment sizes were 
estimated with the GelReader program of T. Redman 
(National Center for Supercomputer Applications, Urbana, 
Illinois) and maps constructed with a modified version of 
GRAN, which takes the end fragments into account. Maps 
35 of oriented cosmids were generated by comparison of the 
two maps of restriction fragments (Figure 6). The ends 
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were placed by fit relative to the neighboring sites for 
32 of the 42 cosmids while the remaining ten were 
determined by the concordance of the possible 
orientations of ends on both enzyme maps using 
5 neighboring known cosmid ends as anchors* One llh4 was 
equivocal and the FASTA analysis of the derived sequences 
showed that the t7 end matched the reverse complement of 
the 25h9 t3 sequence. 

Sequence Analysis 

10 The presence of repetitive sequences was 

determined using the program FASTA which compares all 
sequences to those previously determined supplemented 
with a comprehensive set of di- and tri-nucleotide 
repeats. A FASTA cutoff score of 100 was used to 

15 recognize repetitive sequences from background random 
matches. Similarities to known genes were identified 
with the BLAST program and the GenBank database. Amino 
acid comparisons were performed by translating DNA 
sequence fragments into all six potential reading frames 

20 and comparing translations to protein sequences in the 
non-redundant Swiss-Prot, GenPept or PIR database of the 
National Center for Biotechnology Information (Bethesda, 
MD) using the program BLASTX. The results of these 
various searches were evaluated numerically and by 

25 inspection (see, e.g. f Table 3). The data obtained as 
described herein, including DNA sequence file pointers, 
matches from sequence analysis and information about 
overlapping clones, were stored in a relational database. 
The sequences generated by automated fluorescent 

30 sequencing from cosmid ends have been deposited with 

GenBank and were not edited to remove unidentified bases 
or correct the sequence. 
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Sequences derived from cosmid ends were 
analyzed for the presence of repetitive sequences, simple 
sequence repeats and similarities to known genes (Figure 
7) . A number of genes were detected during the 
5 characterization of sequences as described herein . There 
were three protein kinases and a surface antigen gene in 
the random genetic region; no additional genes were 
detected near the 0-giardin gene* Additional genes could 
likely be determined with gene prediction programs 
10 developed for these applications such as GRAIL. 

Computer-aided primer pair design 

One advantage of mapping and sequence sampling 
is future independence from clone libraries through PCR 
amplification. This advantage was demonstrated within 

15 the random region contig. The selection of primer pairs 
for PCR analysis was carried out using the PRIMER program 
provided by E. Lander (MIT) for each of the cosmid- 
derived end-sequences that were determined. Analysis was 
done in batch processing mode on a Sun workstation 

20 specifying an annealing temperature of 60 *C and a primer 
length of 18 - 22 nt. Modifications of these parameters 
(oligonucleotide length 25 nt and annealing temperatures 
of 55 "C for AT-rich sequences and annealing temperatures 
to 65 °C or greater for GC-rich sequences) generally 

25 allowed production of a suitable primer. Primers were 
produced commercially by Genset, Inc. (Paris, France) . 

Predicted STS primers were tested by PCR 
amplification in a 30 1 reaction volume containing: 10 
mM Tris (pH 8.8), 50 mM KC1, 1.5 mM MgCl 2 , 0.001% gelatin, 
30 200 M each dNTP, 100 ng genomic DNA and 1.5 units of Taq 
DNA polymerase. Initial PCR conditions were: 

denaturation at 93 B C for 2 minutes? 

35 cycles of 30 seconds each at 94 °C; 
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annealing for one minute at the predicted 

annealing temperature, and 
30 seconds at 12* C; followed by 
a final extension at 72 P C for five minutes. 
5 The probes were labeled by random priming with a 32 P dCTP 
and sCos-1 DNA was labeled with 35 S dATP to identify 
specific clones in the hybridization process. 

The ordered sequences were then checked by PCR 
amplification demonstrating amplification of a specific 
10 region of interest independent of the cosmid libraries. 
Products were obtained for most (>85%) of the neighboring 
sequences. Some failures, possibly due to incorrect 
primer sequences and large distances between primers were 
not analyzed further within the random contig region. 

15 In summary, two regions of the Giardia lamblia 

genome were mapped and sequences sampled from within 
those areas of approximately 160 kb (Figures 6A and 6B) . 
The sequenced cosmid ends encompass approximately 15% of 
this total region. The average sequence read was 347 bp 

20 which does not differ from the median (353 bp) and has a 
standard deviation of 39 bp. Eight cases of cosmids 
ending at exactly the same Sau3A sites were observed and 
in four cases' cosmid ends were close enough to one 
another to form small sequence contigs. Overall, the 

25 median spacing between cosmid ends was 1.25 kb with an 
average spacing of 2.0 kb, suggesting that the relatively 
rare events of cloning from defined restriction sites 
follows a poisson process suggested by visual inspection 
of Figure 8. These are underestimates of sequence 

30 overlaps and identical ends since cosmids which intrude 
into the contig without encompassing the region detected 
with the hybridization probe were not included in the 
contig. Full characterization of the region would nearly 
double the amount sequenced to approximately 30%. The 

35 large number of identical ends compared to overlapping 
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sequence contigs suggests that the practical limit for 
constructing a library with one restriction enzyme for 
partial digests is below twenty fold and perhaps as low 
as five to ten fold in the sequence sampling strategy, 

5 

EXAMPLE 4 

Preparation of a Chromosome-11 Sequence Sampled Map 

Cosmid libraries 

Two chromosome 11-specific cosmid libraries 
10 were constructed in vector sCos-1. Cosmids denoted llq 
were isolated from a library prepared from hybrid TG5D1-1 
representing chromosome llql3-llqter (Evans and Lewis , 
1989, Proc. Natl. Acad. Sci. . USA . 86:5030-5034). 
TG5D1-1 is a Friend cell line derived from somatic cell 
15 hybrid 5D1 that carried an intact human X chromosome 11 
[Pyati et al., Proc. Natl. Acad. Sci USA 77:3435 (1980)], 
and was selected for the loss of the entire X chromosome 
and most of chromosome 11. TG5D1-1 contains the distal 
portion of chromosome 11 as the only human material in a 
20 mouse genomic background [Maslen et al. r Genomics 2z 66 
(1988)]. Cytogenetic and molecular analysis indicates 
that the amount of human DNA represented about 1% of the 
mouse genomic background [Maslen et al., Supra ] . 

Cosmids denoted SRL were isolated from a flow 
25 sorted chromosome 11-specific library prepared from 

somatic cell hybrid Jl (described in Kao et al., 1977, 
Somatic. Cell. Genet. . 3:421-429). The SRL library was 
prepared from 100 ng of flow purified chromosome 11 at 
the Los Alamos National Laboratory (L. Deaven and J. 
30 Longmire) as part of the National Gene Library Project 
and represented 125X coverage of chromosome 11. 
Approximately 17,000 clones, representing about 5X 
coverage, were picked from primary platings and 
individually archived in 96-well microtitre plates for 
35 use in this example* Characterization of this library 
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revealed that about 4% consist of non-human inserts. 
Clone names are derived from the library designation 
followed by the plate number and well position (e.g. , 
cllq-2bll denotes a cosmid from the cllq cosmid library 
5 located on plate 2 in well bll) . 

Cosmid end sequencing 

To generate sequence data from the cosmids, DNA 
sequencing was carried out using primers complementary to 
the T3 or T7 polymerase promoter located in the cosmid 

10 vector flanking the insertion site. Template cosmid DNA 
was prepared by an alkaline lysis procedure (Sambrook et 
al., supra ) , or by using one of several DNA prep robots. 
Automated procedures used DNA prepared by an Autogen 540 
DNA preparation robot (cycle 411) , which was subsequently 

15 digested for one hour with RNAse A (75 M9/M1) in a total 
volume of 23 pi, and then precipitated with ethanol. 
Overnight growth of cosmid clones in 5 mis of terrific 
broth (Sambrook et al., supra ) with 10 ug/ml kanamyacin 
generally provided sufficient DNA for one or two 

20 sequencing reactions. Alternatively, DNA template 

preparations were done using the custom "Prepper, Ph.D." 
DNA preparation robot developed by H. R. Garner (Garner 
et al., Sci. Computing and Automation . March/1993, 61- 
68) . Some of the initial sequences in this example were 

25 determined manually using Sequenase kits with dideoxy 
terminators (U. S. Biochemicals) • 

Automated sequencing reactions were carried out 
using dye labeled T3 or T7 oligonucleotide primers with 
reactions assembled and cycle sequenced using the Applied 
30 Biosystems (ABI) Catalyst 800 robot with DNA 

concentrations of -0.2 fig/ )il. Automated DNA sequencing 
was carried out using the ABI 37 3 A sequencer. The names 
of DNA sequences determined from the cosmid clones are 
the clone names followed by -t or -*u, denoting sequences 
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from the t3 or t7 priming sites, respectively. (e.g., 
cllq-2bll-u is the sequence of the t7 end of cosmid 2bll 
from library llq) . 

A sample of 371 DNA sequence fragments were 
5 determined by automated sequencing from cosmid ends* 

Forty-nine sequences were not used for STS production in 
this example and were placed in reserve should additional 
STSs be needed. Extensive regions of repetitive sequence 
which made PCR primer prediction unlikely occurred in 14% 
10 (45/322) of those analyzed (Table 4). The remaining 277 
sequences (86%) were used for STS primer prediction by 
computer analysis and oligonucleotide primers were 
synthesized and tested. 



SUBSTITUTE SHEET (RULE 26) 



WO 94/29486 



PCT/DS94/06810 



53 



3 



bo 

a 

c 
u 



3 

o 



6 



U 

c 
o 

cd 



tS tS t£ tS 

O vO CO 
~+ fN V> 



»r> rM 

vo — » oo 



c 
o 

o 
4) 



ft) 

e 
*c 



5 2 



C 

o 
a 
o 

•a 
o 



c 
o 

•a c 
o o 

e 

o 



a. 
CO 



3 S 

o 



CO 

H 

CO 



2 

o 
o 

to 



o 
cx 
cx 

E 

«o 
CO 
H 

CO 



t>0 

.5 w 

■ i 

cd c 

t | 

R Si 

c o 

.8 S 

* I 

I 8 

•c E 

a. >? 

ti o 

I P 



•a 
c 



P 3 

e 1 - 

52 2 H 

1 1 1 



it 
i 

C 



B 

o 
U 



5 



c 
4) 

E 

e 

o 
u 

io 
XI 

-a 
o 
«> 



o 



o 

> 



SUBSTITUTE SHEET (RULE 26) 



WO 94/29486 



PCT/US94/06810 



54 

The sequences generated by automated 
fluorescent sequencing from cosmid ends have been 
deposited with GenBank and were not edited to remove 
unidentified bases or correct the sequence. The sequence 
5 information was not corrected before analysis so that the 
Primer program would not predict primers in regions of 
questionable accuracy. Consequently, some of the 
alignments showing similarity between known protein 
sequences and translated chromosome 11 specific sequences 
10 contain Xs at unknown and stop codons which could be due 
to errors in the sequence or translating beyond exons 
into introns. 

Computer-aided primer pair design 

The selection of primer pairs for STS analysis 
15 was carried out using the Primer program available from 
E. Lander (MIT) . Analysis was done in batch processing 
mode on a Sun workstation specifying an annealing 
temperature of 60° C and a primer length of 18 - 22 nt. 
Modifications of these parameters (oligonucleotide length 
20 25 nt and annealing temperatures 55 *C for AT-rich 

sequences and annealing temperatures to 65 "C or greater 
for GC-rich sequences) generally allowed production of a 
suitable primer set. DNA sequences that contained 
extensive regions of repetitive sequence so that primers 
25 could not be designed to generate products of at least 

120 bp were not utilized. Primers were synthesized using 
an ABI PCR-Mate oligonucleotide synthesizer or produced 
commercially by Genset, Inc. (Paris , France). 

Of this collection of primers, 77% (212/277) 
30 generated PCR amplification products from human genomic 
DNA and Jl hybrid DNA without producing products from 
yeast genomic DNA and are suitable for STS content 
mapping using YAC libraries. Localization of 88% of 
these new STS markers (188/212) was carried out by hybrid 
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analysis, FISH or a combination of both. For comparison, 
nearly the same success rate occurred in predicting 
primer sets from sequences retrieved from GenBank. In 
some cases primer sets derived from the cSRL library 
5 which failed to amplify human DNA clearly generated a 
product from hamster — the Jl cell line host (Table 5) . 
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STS markers generated from mapped cosmids were 
supplemented with STSs corresponding to almost all of the 
cloned genes on chromosome 11 to allow for eventual 
production of a complete chromosome map. Genes mapped to 
5 chromosome 11 were identified by searching the Genome 
Data Base (GDB; Pearson et al, 1992 f Nucleic Acids Res. , 
20:2201-2206), On-line Mendelian Inheritance in Man 
(OMIM) , GenBank, Medline, and the like. The DNA 
sequences were retrieved from GenBank when available and 

10 STS primers designed using PRIMER. Whenever possible, 
chromosome 11 genomic DNA sequence was used for STS 
production and primer pairs were designed to generate 
products of 400 to 1000 base pairs. When genomic DNA 
sequence was not available, the generally intron-free 3 1 

15 untranslated region of chromosome 11 cDNAs was chosen. 
Where these were not available, cDNA coding regions were 
utilized and the predicted amplification product sizes 
were limited to 150-250 bp in order to minimize the 
chances of including an intron in the target region. In 

20 the overall design of STS primer sets, predicted product 
lengths were intentionally varied over the range 150 to 
1000 bp to allow for eventual multiplex screening of YAC 
libraries. 

Characterization of chromosome 11-specific STSs 



25 Predicted STS primers were tested by PCR 

amplification (Saiki et al., 1988, Science , 239:487-491) 
in a 30 /xl reaction volume containing: 10 mM Tris (pH 
8.8), 50 mM KC1, 1.5 mM MgC12, 0.001% gelatin, 200 /iM 
each dNTP r 100 ng template DNA and 1.5 units of Taq DNA 
30 polymerase. Initial PCR conditions were: 

denaturation at 93 °C for 2 minutes; 
35 cycles of 30 seconds each at 94 *C; 
annealing for one minute at the predicted 
annealing temperature, and 
35 30 seconds at 72 °C; followed by 
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a final extension at 72 °C for five minutes. 
Initially, three different concentrations of primers 
(833 , 500 and 250 nM) were tested for amplification using 
human genomic DNA. An optimal concentration and 
5 temperature were determined for the amplification of the 
following test DNA samples: human genomic DNA, DNA from 
the hybrid cell line Jl containing human chromosome 11 in 
a CHO cell background, hamster genomic DNA, mouse genomic 
DNA and yeast DNA* Those primer sets that produced an 
10 amplification product of the expected size from human 

genomic DNA and Jl DNA were characterized further* Most 
of the STSs generated have a narrow range of annealing 
temperatures from 56*C to 60 °C and relatively short 
primer lengths of around 20 nucleotides. 

15 Physical Mapping by fluorescence in situ 

hybridization (FISH) 

To map the genomic location of each STS, in 
situ hybridization using cosmid DNA was carried out on 
metaphases prepared from human fibroblasts (CRL1634? 

20 Human Genetic Mutant Cell Repository, Camden, New Jersey) 
using well-known methods (see, e.g., Giovannini et al., 
1993, Cytocrenet . Cell . Genet . . 63:62). Chromosomal 
localization was determined as chromosome 11 FLpter 
(fractional chromosomal length from llpter) and 

25 cytogenetic band position was extrapolated from the 

coincidence of FLpter values to cytogenetic bands on the 
chromosome ideogram. 

Somatic cell hybrid analysis 

Additional localization of some cosmid 
30 sequences, and localization of STSs produced from gene 
sequences, were carried out using a panel of hybrid cell 
lines set forth in Table 5 containing part or all of 
human chromosome 11 in mouse or hamster genomic 
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backgrounds. Analyses of the PCR amplification of STSs 
from DNA isolated from this set of hybrids mapped the 
STSs to eight distinct regions, or bins, on chromosome 11 
(Figure 10) . 

5 The whole set of 370 STS markers, associated 

with known genes or chromosome-specific cosmid clones, 
was regionally mapped to chromosome 11. Some of the map 
locations of these cosmids had been reported previously, 
though not associated with STS identifiers of the present 

10 invention. Of the 370 STSs generated, 335 were 
successfully mapped to specific regions of human 
chromosome 11 (Table 6) using FISH or through the PCR 
analysis of a somatic cell mapping panel. Mapping 
information for some markers was available from GDB and 

15 was confirmed employing the techniques described herein 
based upon hybrid analysis. Consistent results were 
found in all cases when mapping information on many 
markers was determined using both FISH and hybrid 
analysis. 

20 In general, the set of standard STSs derived in 

this example appear to be uniformly distributed 
throughout the chromosome and include markers for most of 
the significant mapping landmarks. Chromosome 11 was 
divided into twenty regions and the average distribution 

25 of STSs along chromosome 11 was assessed. The average 
number of STSs per region was 17 with a standard 
deviation of 4.5. The most over-represented region was 
FLpter 0.85 to 1 (llq23.3 - q25) while 0.3 to 0.45 (llp!2 
- ql2.1) and 0.6 to 0.85 (llql4.1 - q23.2) contained 

30 proportionally fewer STSs. The relative precision of the 
mapping information can be estimated from the 
distribution of FLpter ranges of the 335 mapped STSs. 
Twenty-four percent of the markers had an FLpter range of 
less than 0.05, 55% less than 0.1 and 83% less than 

35 0.2. 
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Data Analysis 

As described above , 371 DNA sequences 
fragments, determined at one pass accuracy, comprising 
116 kb of chromosome 11 derived from cosmid ends were 
5 analyzed for the presence of repetitive sequences, simple 
sequence repeats and similarities to known genes. All of 
the sequence fragments were subjected to computer 
analyses for the presence of noteworthy sequence 
structures (Table 7) - The presence of repetitive 

10 sequences was determined using the program FASTA and a 
repetitive sequence database (Jurka et al., 1992, J. Moi. 
Evol. 35, 286-291) supplemented with a comprehensive set 
of di- and tri-nucleotide repeats . A FASTA cutoff score 
of 100 determined visually was used to recognize 

15 repetitive sequences from background random matches. 
Similarities to known genes were identified with the 
program BLAST and the GenBank database. 

Amino acid comparisons were performed by 
translating DNA sequence fragments into all six potential 

20 reading frames and comparing translations to protein 
sequences in the Swiss-Prot, GenPept or PIR databases 
using the program BLASTX. Putative exons were identified 
using the program GRAIL on the Oak Ridge National 
Laboratory Internet server. The results of these various 

25 searches were evaluated numerically and by inspection. 
The data associated with this project, including DNA 
sequence file pointers, predicted STS primer sequences, 
test and mapping results, and in situ hybridization 
analysis, were stored in a relational database called 

30 Genome Notebook specifically designed for this project. 
DNA sequence and mapping information on other genes was 
imported into the Genome Notebook database from GDB and 
GenBank. 



SUBSTITUTE SHEET (RULE 26) 



WO 94/29486 



PCT/US94/06810 



98 



STS MAP OF CHROMOSOME 1 1 
TAttLE 7 

Analysis of 371 random cosmid end sequences determined by automated fluorescent 

sequencing 



Category number percentage 

subcategory total category 



f"*nnfriin rrnpfitivf* DNA 


150 


40% 




Alu 


60 


16% 


(40%) 


LINE-1 


41 


11% 


(27%) 


Middle element repetitive 


17 


5% 


(U%) 


CA repeats 


7 


2% 


(5%) 


others 


27 


7% 


(18%) 


Grail predicted exons 


34 


9% 




Excellent 


6 


2% 


(18%) 


Good 


9 


2% 


(26%) 


Marginal 


19 


5% 


(55%) 


Matches to protein sequences 


29 


8% 




Certain 


7 


2% 


(24%) 


Probable 


2 


\% 


0%) 


Likely 


2 


1% 


(7%) 


Possible 


8 


2% 


(28%) 


Marginal 


10 


3% 


(35%) 



Note: Repetitive sequence analysis was carried out using FASTA and a customized 
repetitive sequence database. Similarities to known proteins was determined using 
BLASTX. Putative exons were determined using GRAIL, The frequency of a sequence 
fragment containing the element is shown in bold and subtypes of repetitive DNA, 
potential exons and sequence match quality is shown as well as percentage of total. The 
last column shows the percentage of sequences within each category 
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The results indicate that the average length of 
reliable sequence was 312 nucleotides with a standard 
deviation of 46 nucleotides. Repetitive DNA sequences of 
some type were found in 150 of these sequences with Alu 
5 elements (40%) and LINE-1 sequences (27%) being the most 
frequent. Middle element repetitive sequences (11%) and 
simple sequence (CA)n repeats (5%) were also detected 
with reasonable frequency. The neural net based program, 
GRAIL, was utilized to predict the locations of possible 

10 exons and detect putative genes in 34 sequences (9%) ; 
half of these were rated excellent or good according to 
reliability estimates used by this program. Analysis for 
additional possible gene sequences was .carried out by 
computer searches for identity or similarity matches at 

15 the nucleotide and amino acid level. Significant matches 
to known protein sequences were detected for twenty-nine 
(8%) of the sequence fragments (see, e.g., Table 8). 
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corresponds to 5% (Table 8; Figure 11). In comparison, 
the rate of gene identification from brain cDNA 
sequencing is about 14%. Strategies for sequencing cDNA 
libraries suffer from the problem of sequencing the same 
5 cDNA multiple times due to the differential abundance of 
mRNAs. As demonstrated here, random genomic sequencing 
is associated with a reasonable rate of gene 
identification (2-5%) coupled with direct gene mapping, 
and thus is considered to be an advantageous strategy for 

10 further characterization of cosmid and YAC clone maps. 
STSs prepared by the methods described herein, will 
provide a useful reagent set for further physical 
mapping, for constructing YAC contigs using STS content 
mapping, and for further DNA sequence analysis. In 

15 addition, automated fluorescent sequencing of randomly 
. chosen cosmid clones is a rapid and powerful tool for 
generating PCR detectable markers as well as defining the 
locations of putative genes. Theoretical analysis of the 
strategy of STS content mapping (Arratia et al., 1991, 

20 Genomics . 11:806-827; Palazzolo et al., 1991, Proc. Natl. 
Acad. Sci.. USA , 88:8034-8038) suggests that this number 
of unique and uniformly distributed markers will serve as 
an appropriate starting point for future physical 
analysis. 

25 In summary, sequences were determined from the 

ends of chromosome 11 specific cosmids by automated 
sequencing without intermediate subcloning. The STSs and 
cosmids were mapped by in situ hybridization, somatic 
cell hybrid analysis or both. This effort generated 370 

30 STSs specific for human chromosome 11 and regionally 

mapped most of them. Sixty-eight percent of these STSs 
(251/370) were produced from new chromosome 11 sequences; 
18% (68/370) represented sequences derived from cloned 
genes; 8% (29/370) were based upon STS markers deposited 

35 and available from GDB. The latter were retested with 
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our set of standard conditions to allow integration of 
this map with the results of other groups * 

While the invention has been described in 
detail with reference to certain preferred embodiments 
5 thereof, it will be understood that modifications and 
variations are within the spirit and scope of that which 
is described and claimed. 
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Sequence Listing 

SEQ ID NO:l 

GGTCAAGCTC AGCAACATGA 

SEQ ID NO: 2 

TGCTTTGTGA CCATCGAGAG 

SEQ ID NO: 3 

CAGCAGATGG TCAAGCAAAA 

SEQ ID NO: 4 

ACTCCTGACA CCACCACCTC 

SEQ ID NO: 5 

TCGCTCACTG ACTCGCTG 

SEQ ID NO: 6 

AGCCCTCCCG TATCGTAGTT 

SEQ ID NO: 7 

CTTGAGAGCC TTCAACCCAG 

SEQ ID NO: 8 

AACTGGGCGG AGTTAGGG 
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That which is claimed is: 

1. A method for sequencing complex genomes , 
said method comprising: 

(1) sequencing the end-specific nucleotides of 
each member of a library of cosmid clones, 

5 wherein said cosmid clones are prepared by 

inserting genomic DNA fragments into 
cosmid vectors, and 
wherein the cosmid vectors include sequences of 
nucleotides that flank at least one end of 
10 the inserted DNA, and that serve as 

transcription initiation sites for the 
synthesis of nucleic acids specific to the 
ends of the inserted DNA, and 

(2) assembling a sequence sampled map by 
15 correlating the end-specific nucleotide sequence 

information with the relative spatial relationship 
between the cosmids. 

2. A method according to Claim 1 wherein the 
relative spatial relationship between the cosmids has 
been determined prior to said sequencing the end-specific 
nucleotides of each member of a library of cosmid clones. 

3. A method according to Claim 1 wherein the 
relative spatial relationship between the cosmids is 
determined by the cosmid multiplex analysis method. 

4. A method according to Claim 1 wherein the 
relative spatial relationship between the cosmids is 
determined by restriction-fragment-length mapping of the 
cosmids - 

5. A method according to Claim 1 wherein at 
least 100 base pairs of end-specific nucleotide sequences 
are determined. 
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6. A method according to Claim 1 wherein said 
cosmid clones are generated in cosmid vectors allowing 
for the synthesis of end-specific nucleic acid sequences 
directly from at least one end of DNA fragments inserted 

5 therein. 

7. A method according to Claim 6 wherein said 
cosmid vectors comprise at least one promoter specific 
for a bacteriophage RNA polymerase and a cloning site 
allowing for the insertion of DNA fragments , said 

5 promoter being positioned operatively for transcription 
of a DNA fragment inserted into said cloning site. 

8. A method according to Claim 7 wherein said 
cosmid vectors comprise two oppositely oriented 
promoters, each of which is specific for a bacteriophage 
RNA polymerase, positioned on two sides of said cloning 

5 site, operatively for transcription of a DNA fragment 
inserted into said cloning site* 

9. A method according to Claim 8 wherein each 
of said bacteriophage RNA polymerase-specif ic promoters 
is selected from the group consisting of promoters 
specific for bacteriophage T7 RNA polymerase, and 

5 promoters specific for bacteriophage T3 RNA polymerase. 

10. A method according to Claim 9 wherein said 
cosmid vector is selected from the group consisting of 
pWE8, pWElO, pWE15, and pWE16. 

11. A method according to Claim 6 wherein said 
cosmid vectors comprise at least two cos sites. 

12. A method according to Claim 11 wherein 
said cos sites are separated by unique restriction sites. 
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13. A method according to Claim 12 wherein 
said cosmid vector is selected from the group consisting 
of sCOS-1, sCOS-2, sCOS-4, and derivatives thereof. 



14- A method for sequencing complex genomes, 
said method comprising: 

(1) preparing a genomic library of cosmid 
clones by inserting DNA fragments from said genome into 
5 cosmid vectors, wherein the cosmid vectors include 

sequences of nucleotides that flank at least one end of 
the inserted DNA, and that serve as transcription 
initiation sites for the synthesis of end-specific 
probes , 

10 (2) arranging the cosmid clones, whereby each 

clone may be identified and replicas of said arrangement 
may be reproduced, 

(3) pooling portions of said cosmid clones and 
synthesizing pools of mixed end-specific probes from the 

15 DNA inserts that have been prepared from said pooled 

clones, wherein each pool contains fewer than all of the 
cosmid clones in the library, but all of the cosmid 
clones in the library are included in at least one pool, 

(4) hybridizing each pool of probes to a 

20 replica of said arranged cosmid clones and identifying 
the cosmid clones in each replica that hybridize to the 
probes, wherein said identified clones include the pooled 
cosmid clones and cosmid clones that contain DNA inserts 
that overlap with the DNA inserts in the pooled clones, 

25 (5) identifying the cosmid clones from among 

those identified in step (4) that hybridize to two or 
more pools of probes, thereby identifying groups of 
cosmid clones that include overlapping DNA, 

(6) assembling contigs from said groups, and 

30 (7) sequencing the fragment ends of the DNA 

inserts of each of the overlapping cosmid clones. 
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15. A method according to Claim 14 wherein 
cross-hybridizing clones are identified by comparing the 
data sets obtained from two groups of cosmid clones 
containing at least one common clone, and repeating the 

5 pairwise comparison with other groups of clones 
containing at least one common clone. 

16. A method according to Claim 14 wherein 
said cosmid clones are pooled according to the rows and 
columns of a two-dimensional matrix, and said mixed 
end-specific RNA sequences are hybridized to a replica of 

5 the entire matrix • 

17. A method according to Claim 14 wherein 
said cosmid clones are pooled according to the planes 
intersecting with a three-dimensional matrix, and said 
mixed end-specific RNA sequences are hybridized to a 

5 replica of the entire matrix. 

18. A method according to Claim 14 wherein 
said cosmid clones are generated in cosmid vectors 
allowing for the synthesis of end-specific RNA sequences 
directly from at least one end of DNA fragments inserted 

5 therein. 

19. A method according to Claim 18 wherein 
said cosmid vectors comprise at least one promoter 
specific for a bacteriophage RNA polymerase and a cloning 
site allowing for the insertion of DNA fragments, said 

5 promoter being positioned operatively for transcription 
of a DNA fragment inserted into said cloning site. 
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20. A method according to Claim 14 wherein 
said cosmid vectors comprise two oppositely oriented 
promoters, each of which is specific for a bacteriophage 
RNA polymerase, positioned on two sides of said cloning 
5 site, operatively for transcription of a DNA fragment 
inserted into said cloning site. 
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